AITopics | raw corpus

Collaborating Authors

raw corpus

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

BitAbuse: A Dataset of Visually Perturbed Texts for Defending Phishing Attacks

Lee, Hanyong, Lee, Chaelyn, Lee, Yongjae, Lee, Jaesung

arXiv.org Artificial IntelligenceFeb-6-2025

Phishing often targets victims through visually perturbed texts to bypass security systems. The noise contained in these texts functions as an adversarial attack, designed to deceive language models and hinder their ability to accurately interpret the content. However, since it is difficult to obtain sufficient phishing cases, previous studies have used synthetic datasets that do not contain real-world cases. In this study, we propose the BitAbuse dataset, which includes real-world phishing cases, to address the limitations of previous research. Our dataset comprises a total of 325,580 visually perturbed texts. The dataset inputs are drawn from the raw corpus, consisting of visually perturbed sentences and sentences generated through an artificial perturbation process. Each input sentence is labeled with its corresponding ground truth, representing the restored, non-perturbed version. Language models trained on our proposed dataset demonstrated significantly better performance compared to previous methods, achieving an accuracy of approximately 96%. Our analysis revealed a significant gap between real-world and synthetic examples, underscoring the value of our dataset for building reliable pre-trained models for restoration tasks. We release the BitAbuse dataset, which includes real-world phishing cases annotated with visual perturbations, to support future research in adversarial attack defense.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2502.05225

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > New York > New York County > New York City (0.04)
North America > Canada > Alberta > Census Division No. 15 > Improvement District No. 9 > Banff (0.04)
(5 more...)

Genre: Research Report > New Finding (0.66)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Text Grafting: Near-Distribution Weak Supervision for Minority Classes in Text Classification

Peng, Letian, Gu, Yi, Dong, Chengyu, Wang, Zihan, Shang, Jingbo

arXiv.org Artificial IntelligenceJun-16-2024

For extremely weak-supervised text classification, pioneer research generates pseudo labels by mining texts similar to the class names from the raw corpus, which may end up with very limited or even no samples for the minority classes. Recent works have started to generate the relevant texts by prompting LLMs using the class names or definitions; however, there is a high risk that LLMs cannot generate in-distribution (i.e., similar to the corpus where the text classifier will be applied) data, leading to ungeneralizable classifiers. In this paper, we combine the advantages of these two approaches and propose to bridge the gap via a novel framework, \emph{text grafting}, which aims to obtain clean and near-distribution weak supervision for minority classes. Specifically, we first use LLM-based logits to mine masked templates from the raw corpus, which have a high potential for data synthesis into the target minority class. Then, the templates are filled by state-of-the-art LLMs to synthesize near-distribution texts falling into minority classes. Text grafting shows significant improvement over direct mining or synthesis on minority classes. We also use analysis and case studies to comprehend the property of text grafting.

computational linguistic, minority class, synthesis, (15 more...)

arXiv.org Artificial Intelligence

2406.11115

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
North America > Dominican Republic (0.04)
(16 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

LLM-enhanced Self-training for Cross-domain Constituency Parsing

Li, Jianling, Zhang, Meishan, Guo, Peiming, Zhang, Min, Zhang, Yue

arXiv.org Artificial IntelligenceNov-5-2023

Self-training has proven to be an effective approach for cross-domain tasks, and in this study, we explore its application to cross-domain constituency parsing. Traditional self-training methods rely on limited and potentially low-quality raw corpora. To overcome this limitation, we propose enhancing self-training with the large language model (LLM) to generate domain-specific raw corpora iteratively. For the constituency parsing, we introduce grammar rules that guide the LLM in generating raw corpora and establish criteria for selecting pseudo instances. Our experimental results demonstrate that self-training for constituency parsing, equipped with an LLM, outperforms traditional methods regardless of the LLM's performance. Moreover, the combination of grammar rules and confidence criteria for pseudo-data selection yields the highest performance in the cross-domain constituency parsing.

constituency, grammar rule, target domain, (15 more...)

arXiv.org Artificial Intelligence

2311.0266

Country:

Asia > China > Tianjin Province > Tianjin (0.04)
Asia > China > Heilongjiang Province > Harbin (0.04)
Asia > China > Guangdong Province > Shenzhen (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Grammars & Parsing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Information Extraction based on Named Entity for Tourism Corpus

Chantrapornchai, Chantana, Tunsakul, Aphisit

arXiv.org Artificial IntelligenceJan-3-2020

Tourism information is scattered around nowadays. To search for the information, it is usually time consuming to browse through the results from search engine, select and view the details of each accommodation. In this paper, we present a methodology to extract particular information from full text returned from the search engine to facilitate the users. Then, the users can specifically look to the desired relevant information. The approach can be used for the same task in other domains. The main steps are 1) building training data and 2) building recognition model. First, the tourism data is gathered and the vocabularies are built. The raw corpus is used to train for creating vocabulary embedding. Also, it is used for creating annotated data. The process of creating named entity annotation is presented. Then, the recognition model of a given entity type can be built. From the experiments, given hotel description, the model can extract the desired entity,i.e, name, location, facility. The extracted data can further be stored as a structured information, e.g., in the ontology format, for future querying and inference. The model for automatic named entity identification, based on machine learning, yields the error ranging 8%-25%.

information, ontology, raw corpus, (15 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/JCSSE.2019.8864166

2001.01588

Country:

Asia > Thailand > Bangkok > Bangkok (0.05)
Europe > Austria > Vienna (0.05)
Europe > Austria > Tyrol > Innsbruck (0.04)
(14 more...)

Genre: Research Report (0.41)

Industry: Consumer Products & Services > Travel (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback